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Abstract — Real-time object detection has many applications in 
video surveillance, teleconference and multimedia retrieval etc. 
Since Viola and Jones 1 1 1 proposed the first real-time AdaBoost 
based face detection system, much effort has been spent on 
improving the boosting method. In this work, we first show that 
feature selection methods other than boosting can also be used for 
training an efficient object detector. In particular, we introduce 
Greedy Sparse Linear Discriminant Analysis (GSLDA) |2 | for its 
conceptual simplicity and computational efficiency; and slightly 
better detection performance is achieved compared with |1J. 
Moreover, we propose a new technique, termed Boosted Greedy 
Sparse Linear Discriminant Analysis (B GSLDA), to efficiently 
train a detection cascade. BGSLDA exploits the sample re- 
weighting property of boosting and the class-separability crite- 
rion of GSLDA. Experiments in the domain of highly skewed data 
distributions, e.g., face detection, demonstrates that classifiers 
trained with the proposed BGSLDA outperforms AdaBoost and 
its variants. This finding provides a significant opportunity to 
argue that AdaBoost and similar approaches are not the only 
methods that can achieve high classification results for high 
dimensional data in object detection. 

Index Terms — Object detection, AdaBoost, asymmetry, greedy 
sparse linear discriminant analysis, feature selection, cascade 
classifier. 

L Introduction 

REAL-TIME objection detection such as face detection 
has numerous computer vision appHcations, e.g., intelH- 
gent video surveillance, vision based teleconference systems 
and content based image retrieval. Various detectors have been 
proposed in the literature Ql, O, IH. Object detection is chal- 
lenging due to the variations of the visual appearances, poses 
and illumination conditions. Furthermore, object detection is a 
highly -imbalanced classification task. A typical natural image 
contains many more negative background patterns than object 
patterns. The number of background patterns can be 100, 000 
times larger than the number of object patterns. That means, if 
one wants to achieve a high detection rate, together with a low 
false detection rate, one needs a specific classifier. The cascade 
classifier takes this imbalanced distribution into consideration 
0. Because of the huge success of Viola and Jones' real- 
time AdaBoost based face detector a lot of incremental 
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work has been proposed. Most of them have focused on 
improving the underlying boosting method or accelerating the 
training process. For example, AsymBoost was introduced in 
|5 | to alleviate the limitation of AdaBoost in the context of 
highly skewed example distribution. Li et al. |6| proposed 
FloatBoost for a better detection accuracy by introducing a 
backward feature elimination step into the AdaBoost training 
procedure. Wu et al. \1\ used forward feature selection for fast 
training by ignoring the re- weighting scheme in AdaBoost. 
Another technique based on the statistics of the weighted 
input data was used in |8| for even faster training. KLBoost 
was proposed in ||9l to train a strong classifier. The weak 
classifiers of KLBoost are based on histogram divergence of 
linear features. Therefore in the detection phase, it is not as 
efficient as Haar-like features. Notice that in KLBoost, the 
classifier design is separated from feature selection. In this 
work (part of which was published in preliminary form in 
1 10 1), we propose an improved learning algorithm for face 
detection, dubbed Boosted Greedy Sparse Linear Discriminant 
Analysis (BGSLDA). 

Viola and Jones Q introduced a framework for selecting 
discriminative features and training classifiers in a cascaded 
manner as shown in Fig. [T] The cascade framework allows 
most non-face patches to be rejected quickly before reaching 
the final node, resulting in fast performance. A test image 
patch is reported as a face only if it passes tests in all nodes. 
This way, most non-face patches are rejected by these early 
nodes. Cascade detectors lead to very fast detection speed 
and high detection rates. Cascade classifiers have also been 
used in the context of support vector machines (SVMs) for 
faster face detection |4|. In |11|, soft-cascade is developed 
to reduce the training and design complexity. The idea was 
further developed in |[T2ll . We have followed Viola and Jones' 
original cascade classifiers in this work. 

One issue that contributes to the efficacy of the system 
comes from the use of AdaBoost algorithm for training 
cascade nodes. AdaBoost is a forward stage- wise additive 
modeling with the weighted exponential loss function. The 
algorithm combines an ensemble of weak classifiers to produce 
a final strong classifier with high classification accuracy. 
AdaBoost chooses a small subset of weak classifiers and 
assign them with proper coefficients. The linear combination 
of weak classifiers can be interpreted as a decision hyper- 
plane in the weak classifier space. The proposed BGSLDA 
differs from the original AdaBoost in the following aspects. 
Instead of selecting decision stumps with minimal weighted 
error as in AdaBoost, the proposed algorithm finds a new 
weak leaner that maximizes the class-separability criterion. 
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Fig. 1. A cascade classifier with multiple nodes. Here a circle represents a 
node classifier. An input patch is classified as a target only when it passes 
tests at each node classifier. 



As a result, the coefficients of selected weak classifiers are 
updated repetitively during the learning process according to 
this criterion. 

Our technique differs from fl\ in the following aspects. 
im proposed the concept of Linear Asymmetric Classifier 
(LAC) by addressing the asymmetries and asymmetric node 
learning goal in the cascade framework. Unlike our work 
where the features are selected based on the Linear Dis- 
criminant Analysis (LDA) criterion, \1\ selects features using 
AdaBoost/AsymBoost algorithm. Given the selected features, 
Wu et al. then build an optimal linear classifier for the node 
learning goal using LAC or LDA. Note that similar techniques 
have also been applied in neural network. In 1 13 1, a nonlinear 
adaptive feed-forward layered network with linear output units 
has been introduced. The input data is nonlinearly transformed 
into a space in which classes can be separated more easily. 
Since LDA considers the number of training samples of each 
class, applying LDA at the output of neural network hidden 
units has been shown to increase the classification accuracy 
of two-class problem with unequal class membership. As our 
experiments show, in terms of feature selection, the proposed 
BGSLDA methods is superior than AdaBoost and AsymBoost 
for object detection. 

The key contributions of this work are as follows. 

• We introduce GSLDA as an alternative approach for train- 
ing face detectors. Similar results are obtained compared 
with Viola and Jones' approach. 

• We propose a new algorithm, BGSLDA, which combines 
the sample re-weighting schemes typically used in boost- 
ing into GSLDA. Experiments show that BGSLDA can 
achieve better detection performances. 

• We show that feature selection and classifier training 
techniques can have different objective functions (in 
other words, the two processes can be separated) in the 
context of training a visual detector. This offers more 
flexibility and even better performance. Previous boosting 
based approaches select features and train a classifier 
simultaneously. 

• Our results confirm that it is beneficial to consider the 
highly skewed data distribution when training a detector. 
LDA's learning criterion already incorporates this imbal- 
anced data information. Hence it is better than standard 
AdaBoost's exponential loss for training an object detec- 
tor. 

The remaining parts of the paper are structured as follows. 



In Section II-A the GSLDA algorithm is introduced as an 
alternative learning technique to object detection problems. 
We then discuss how LDA incorporates imbalanced data 



in Sections |II-C| and |II-D| the proposed BGSLDA algorithm 
is described and the training time complexity is discussed. 
Experimental results are shown in Section [Ill| and the paper is 
concluded in Section Hv] 

II. Algorithms 

In this section, we present alternative techniques to Ad- 
aBoost for object detection. We start with a short explanation 
of the concept of GSLDA |14|. Next, we show that like 
AsymBoost |5|, LDA is better at handling asymmetric data 
than AdaBoost. We also propose the new algorithm that 
makes use of sample re-weighting scheme commonly used 
in AdaBoost to select a subset of relevant features for training 
the GSLDA classifier. Finally, we analyze the training time 
complexity of the proposed method. 

A. Greedy Sparse Linear Discriminant Analysis 

Linear Discriminant Analysis (LDA) can be cast as a 
generalized eigenvalue decomposition. Given a pair of sym- 
metric matrices corresponding to the between-class (iS^) and 
within-class covariance matrices {S^), one maximizes a class- 
separability criterion defined by the generalized Rayleigh 
quotient: 
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(1) 



The optimal solution of a generalized Rayleigh quotient is 
the eigenvector corresponding to the maximal eigenvalue. The 
sparse version of LDA is to solve ([T]) with an additional 
sparsity constraint: 



Card(tt;) 



(2) 



where Card(-) counts the number of nonzero components, 
a.k.a. the norm, /c G Z+ is an integer set by a user. Due to 
this sparsity constraint, the problem becomes non-convex and 
NP-hard. In |14|, Moghaddam et al. presented a technique to 
compute optimal sparse linear discriminants using branch and 
bound approach. Nevertheless, finding the exact global optimal 
solutions for high dimensional data is infeasible. The algorithm 
was extended in |2|, with new sparsity bounds and efficient 
matrix inverse techniques to speed up the computation time 
by 1000 X. The technique works by sequentially adding the 
new variable which yields the maximum eigenvalue (forward 
selection) until the maximum number of elements are selected 
or some predefined condition is met. As shown in |2|, for two- 
class problem, the computation can be made very efficient 
as the only finite eigenvalue Xma^^iSh^ Sw) can be computed 
in closed-form as S~^h with ^'5 = hh^ because in this 
case 5*5 is a rank-one matrix. 6 is a column vector. Therefore, 
the computation is mainly determined by the inverse of S^. 
When a greedy approach is adopted to sequentially find the 
suboptimal w, a simple rank-one update for computing 
significantly reduces the computation complexity |2|. We have 
mainly used forward greedy search in this work. For forward 
greedy search, if / is the current subset of k indices and m = 
/ U z for candidate i which is not in The new augmented 
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inverse (S^^) ^ can be calculated in a fast way by recycling 
the last step's result (6'^)"^: 



(Si 



+ aiUiU- 
-ajUj 



-CLiUi 



(3) 



^Sw,ii with (/i) indexing the / rows and 



where Ui = {S^ 

i-th column of and = l/{Sw^ii - Sj^^i^Ui) IHI, 13. 

Note that we have experimented with other sparse linear 
regression and classification algorithms, e.g., ^i-norm linear 
support vector machines, £i-norm regularized log-linear mod- 
els, etc. However, the major drawback of these techniques is 
that they do not have an explicit parameter that controls the 
number of features to be selected. The trade-off parameter 
(regularization parameter) only controls the degree of sparse- 
ness. One has to tune this parameter using cross-validation. 
Also £i penalty methods often lead to sub-optimal sparsity 
|[T6ll . Hence, we have decided to apply GSLDA, which makes 
use of greedy feature selection and the number of features can 
be predefined. It would be of interest to compare our method 
with ^i-norm induced sparse models |17|. 

The following paragraph explains how we apply GSLDA 
classifier |2| as an alternative feature selection method to 
classical Viola and Jones' framework (H. 

Due to space limit, we omit the explanation of cascade 
classifiers. Interested readers should refer to (H, |7l for details. 
The GSLDA object detector operates as follows. The set of 
selected features is initialized to an empty set. The first step 
(lines 4 — 5) is to train weak classifiers, for example, decision 
stumps on Haar featuresj^For each Haar-like rectangle feature, 
the threshold that gives the minimal classification error is 
stored into the lookup table. In order to achieve maximum 
class separation, the output of each decision stump is examined 
and the decision stump whose output yields the maximum 
eigenvalue is sequentially added to the list (line 7, step (1)). 
The process continues until the predefined condition is met 
(line 6). 

The proposed GSLDA based detection framework is sum- 
marized in Algorithm [T] 

B. Linear Discriminant Analysis on Asymmetric Data 

In the cascade classifiers, we would prefer to have a 
classifier that yields high detection rates without introducing 
many false positives. Binary variables (decision stump outputs) 
take the Bernoulli distribution and it can be easily shown that 
the log likelihood ratio is a linear function. In the Bayes sense, 
linear classifiers are optimum for normal distributions with 
equal covariance matrices. However, due to its simplicity and 
robustness, linear classifier has shown to perform well not only 
for normal distributions with unequal covariance matrices but 
also non-normal distributions. A linear classifier can be written 

^We introduce nonlinearity into our system by applying decision stump 
learning to raw Haar feature values. By nonlinearly transforming the data, the 
input can now be separated more easily using simple linear classifiers. 

Note that any classifiers can be applied here. We also use LDA on 
covariance features for human detection as described 1 18 1. For the time being, 
we focus on decision stumps on Haar-like features. We will give details about 
covariance features later. 



Algorithm 1 The training procedure for building a cascade of 
GSLDA object detector. 

Input: 

• A positive training set and a negative training set; 

• A set of Haar-like rectangle features hi,h2, - ■ • 

• Dmin- minimum acceptable detection rate per cascade level; 

• i^max: maximum acceptable false positive rate per cascade 
level; 

• i^target: target overall false positive rate; 

1 Initialize: i = 0; A = 1; i^i = 1; 

2 while Ftarget < Fi do 

3 i = i + 1; = 1; 

4 foresLch feature do 

5 Train a weak classifier (e.g., a decision stump 
parameterized by a threshold 0) with the smallest error 
on the training set; 

6 while fi > 

-^'max 

do 

7 1. Add the best weak classifier (e.g., decision stump) 
that yields the maximum class separation; 

8 2. Lower classifier threshold such that Dmin holds; 

9 3. Update fi using this classifier threshold; 

10 Di-^i = Di X Dmin', Fi-^i = Fi X fi; and remove correctly 
classified negative samples from the training set; 

11 if Ft 

arget < Fi then 

12 Evaluate the current cascaded classifier on the negative 
images and add misclassified samples into the negative 
training set; 

Output: 

• A cascade of classifiers for each cascade level i = 1, • • S 

• Final training accuracy: Fi and Di; 



as 



F{x) 



otherwise, 



(4) 



where h{-) defines a function which returns binary outcome, 
X is the input image features and 6 is an optimal threshold 
such that the minimum number of examples are misclassified. 
In this paper, our linear classifier is the summation of decision 
stump classifiers. By central limit theorem, the linear classifier 
is close to normal distribution for large n. 

The asymmetric goal for training cascade classifiers can be 
written as a trade-off between false acceptance rate ei and 
false rejection rate £2 as 



r = Si +/i£2, 



(5) 



where /i is a trade-off parameter. The objective of LDA is 
to maximize the projected between-class covariance matrix 
(distance between the mean of two classes) and minimize the 
within-class covariance matrix (total covariance matrix). The 
selected weak classifier is guaranteed to achieve this goal. Hav- 
ing large projected mean difference and small projected class 
variance indicates that the data can be separated more easily 
and, hence, the asymmetric goal can also be achieve more 
easily. On the other hand, AdaBoost minimizes symmetric ex- 
ponential loss function that does not guarantee high detection 
rates with little false positives |5|. The selected features are 
therefore no longer optimal for the task of rejecting negative 
samples. 
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Another way to think of this is that AdaBoost sets initial 
positive and negative sample weights to 0.5/7Vp and 0.5/7Vn 
(Np and is the number of positive samples and negative 
samples). The prior information about the number of samples 
in each class is then completely lost during training. In 
contrast, LDA takes the number of samples in each class into 
consideration when solving the optimization problem, i.e., the 
number of samples is used in calculating the between-class 
covariance matrix (Sb)- Hence, 5'b is the weighted difference 
between class mean and sample mean. 

Ci 

where /ic, = ^cT"^ EjGc, ^i' ^ = ^c, is the 

number of samples in class Ci and N is the total number 
of samples. This extra information minimizes the effect of 
imbalanced data set. 

In order to demonstrate this, we generate an artificial data 
set similar to one used in Q. We learn a classifier consisting 
of 4 linear classifiers and the results are shown in Fig. [2] From 
the figure, we see that the first weak classifier (#1) selected by 
both algorithms are the same since it is the only linear classifier 
with minimal error. AdaBoost then re- weights the samples 
and selects the next classifier (#2) which has the smallest 
weighted error. From the figure, the second weak classifier 
(#2) introduces more false positives to the final classifier. Since 
most positive samples are correctly classified, the positive 
samples' weights are close to zero. AdaBoost selects the 
next classifier (#3) which classifies all samples as negative. 
Therefore it is clear that all but the first weak classifier learned 
by AdaBoost are poor because it tries to balance positive and 
negative errors. The final combination of these classifiers are 
not able to produce high detection rates without introducing 
many false positives. In contrast to AdaBoost, GSLDA selects 
the second and third weak classifier (#2, #3) based on the 
maximum class separation criterion. Only the linear classifier 
whose outputs yields the maximum distance between two 
classes is selected. As a result, the selected linear classifiers 
introduce much less false positives (Fig. |2]). 

In lO, Viola and Jones pointed out the limitation of Ad- 
aBoost in the context of highly skewed example distribution 
and proposed a new variant of AdaBoost called AsymBoost 
which is experimentally shown to give a significant perfor- 
mance improvement over conventional boosting. In brief, the 
sample weights were updated before each round of boosting 
with the extra exponential term which causes the algorithm to 
gradually pay more attention to positive samples in each round 
of boosting. Our scheme based on LDAs class-separability 
can be considered as an alternative classifier to AsymBoost 
that also takes asymmetry information into consideration. 

C. Boosted Greedy Sparse Linear Discriminant Analysis 

Before we introduce the concept of BGSLDA, we present 
a brief explanation of boosting algorithms. Boosting is one 
of the most popular learning algorithms. It was originally 
designed for classification problems. It combines the output 
of many weak classifiers to produce a single strong learner. 



selected weak classifiers (AdaBoost) 




(b) 

Fig. 2. Two examples on a toy data set: (a) AdaBoost classifier; (b) GSLDA 
classifier (forward pass), x's and o's represent positive and negative samples, 
respectively. Weak classifiers are plotted as lines. The number on the line 
indicates the order in which weak classifiers are selected. AdaBoost selects 
weak classifiers for attempting to balance weighted positive and negative error. 
Notice that AdaBoost's third weak classifier classifies all samples as negative 
due to the very small positive sample weights. In contrast, GSLDA selects 
weak classifiers based on the maximum class separation criterion. We see 
that four weak classifiers of GSLDA model the positives well and most of 
the negative are rejected. 

Weak classifier is defined as a classifier with accuracy on 
the training set greater than average. There exist many vari- 
ants of boosting algorithms, e.g., AdaBoost (minimizing the 
exponential loss), GentleBoost (fitting regression function by 
weighted least square methods), LogitBoost (minimizing the 
logistic regression cost function) |19|, LPBoost (minimizing 
the Hinge loss) 1201 . 1211 . etc. All of them have an identical 
property of sample re- weighting and weighted majority vote. 
One of the wildly used boosting algorithm is AdaBoost 1221 . 
AdaBoost is a greedy algorithm that constructs an additive 
combination of weak classifiers such that the exponential loss 

L{y,F{x))=eM-yF{x)) 

is minimized. Here x is the labeled training examples and 
y is its label; F(x) is the final decision function which 
outputs the decided class label. Each training sample receives a 
weight Ui that determines its significance for training the next 
weak classifier. In each boosting iteration, the value of at is 
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computed and the sample weights are updated according to 
the exponential rule. AdaBoost then selects a new hypothesis 
h{-) that best classifies updated training samples with minimal 
classification error e. The final decision rule F{-) is a linear 
combination of the selected weak classifiers weighted by their 
coefficients at. The classifier decision is given by the sign of 
the linear combination 

F{x) = sign(^'^ athtix)^ 

where at is a weight coefficient; ht{-) is a weak learner 
and is the number of weak classifiers. The expression 
of the above equation is similar to an expression used in 
dimensionality reduction where F{x) can be considered as 
the result of linearly projecting the random vector Fi onto a 
one dimensional space along the direction of a. 

In previous section, we have introduced the concept of 
GSLDA in the domain of object detection. However, decision 
stumps used in GSLDA algorithm are learned only once to 
save computation time. In other words, once learned, an opti- 
mal threshold, which gives smallest classification error on the 
training set, remains unchanged during GSLDA training. This 
speeds up the training process as also shown in forward feature 
selection of |7|. However, it limits the number of decision 
stumps available for GSLDA classifier to choose from. As a 
result, GSLDA algorithm fails to perform at its best. In order 
to achieve the best performance from the GSLDA classifier, 
we propose to extend decision stumps used in GSLDA training 
with sample re-weighting techniques used in boosting meth- 
ods. In other words, each training sample receives a weight 
and the new set of decision stumps are trained according to 
these sample weights. The objective criterion used to select the 
best decision stump is similar to the one applied in step (1) in 
Algorithm [T] Note that step (3) in Algorithm [2] is introduced 
in order to speed up the GSLDA training process. In brief, we 
remove decision stumps with weighted error larger than e/e + £ 
where et = ^ - \Pk. Pk = max{^f^^Uiyiht{xi)) and N 
is the number of samples, i/i is the class label of sample Xi, 
ht{xi) is the prediction of the training data Xi using weak 
classifier ht. The condition used here has connection with the 
dual constraint of the soft margin LPBoost f2Q|. The dual 
objective of LPBoost minimizes (3 subject to the constraints 



and 



1, < Uj < const, Vi 



As a result, the sample weights Ui is the most pessimistic one. 
We choose decision stumps with weighted error smaller than 
e/c + These decision stumps are the ones that perform best 
under the most pessimistic condition. 

Given the set of decision stumps, GSLDA selects the stump 
that results in maximum class separation (step (4)). The sample 
weights can be updated using different boosting algorithm 
(step (5)). In our experiments, we use AdaBoost |1| re- 
weighting scheme (BGSLDA - scheme 1). 

(t+i) _ uf'' ex.p{-atyiht{xi)) 



Algorithm 2 The training algorithm for building a cascade of 
BGSLDA object detector. 

1 while Ftarget < Fi do 

2 2 = ^ + 1; 

3 /^ = 1; 

4 while fi > do 

5 1. Normalize sample weights u; 

6 2. Train weak classifiers h(-) (e.g., decision stumps by 
finding an optimal threshold 0) using the training set 
and sample weights; 

3. Remove those weak classifiers with weighted error 
larger than Ck -\- s (section [Tl-C| ); 

4. Add the weak classifier whose output yields the 
maximum class separation; 

5. Update sample weights u in the AdaBoost manner 
(Eq. (|7}) or AsymBoost manner (Eq. ([8])); 

10 6. Lower threshold such that Dmin holds; 

11 7. Update fi using this threshold; 

12 A + l = DiX Dmin, 

13 Fi+i = Fi X fi; and remove those correctly classified 
negative samples from the training set; 

14 if Ftarget < Fi then 

15 Evaluate the current cascaded classifier on the negative 
images and add misclassified samples into the negative 
training set; 



with 



Here at = log((l — et)/{et)) and et is the weighted error. 
We also use AsymBoost LSJ re- weighting scheme (BGSLDA 
- scheme 2). 



(t+i) _ iif ^ ex.p{-atyiht{xi)) ex.p{yi log Vfe) 



(8) 



with 



D 



(7) 



^(^+1) ^ ^(t) exp{-atyiht{xi)) exp{yi log Vk). 

Since BGSLDA based object detection framework has the 
same input/output as GSLDA based detection framework, we 
replace lines 2 — 10 in Algorithm [T] with Algorithm [2] 

D. Training Time Complexity of BGSLDA 

In order to analyze the complexity of the proposed system, 
we need to analyze the complexity of boosting and GSLDA 
training. Let the number of training samples in each cascade 
layer be N. For boosting, finding the optimal threshold of each 
feature needs 0(A^log A^). Assume that the size of the feature 
set is M and the number of weak classifiers to be selected 
is T. The time complexity for training boosting classifier is 
0{MTN log N). The time complexity for GSLDA forward 
pass is 0{NMT + MT^). 0{N) is the time complexity 
for finding mean and variance of each features. 0(T'^) is 
the time complexity for calculating correlation for each fea- 
ture. Since, we have M features and the number of weak 
classifiers to be selected is T, the total time for complexity 
for GSLDA is 0{NMT + MT^). Hence, the total time 




Fig. 3. A random sample of face images for training. 

TABLE I 

The size of training and test sets used on the single node 
classifier. 



# 


data splits 


faces /split 


non-faces/ split 


Train 


3 


2000 


2000 


Test 


2 


2000 


2000 



complexity is 0{MTN log N ^ NMT + MT^). Since, T is 
^ V ' ' V ' 

weak classifier GSLDA 

often small (less than 200) in cascaded structure, the term 
0{MTN log N) often dominates. In other words, most of the 
computation time is spent on training weak classifiers. 



III. Experiments 

This section is organized as follows. The datasets used in 
this experiment, including how the performance is analyzed, 
are described. Experiments and the parameters used are then 
discussed. Finally, experimental results and analysis of differ- 
ent techniques are presented. 

A. Face Detection with the GSLDA Classifier 

Due to its efficiency, Haar-like rectangle features Q have 
become a popular choice as image features in the context of 
face detection. Similar to the work in |1|, the weak learning 
algorithm known as decision stump and Haar-like rectangle 
features are used here due to their simplicity and efficiency. 
The following experiments compare AdaBoost and GSLDA 
learning algorithms in their performances in the domain of face 
detection. For fast AdaBoost training of Haar-like rectangle 
features, we apply the pre-computing technique similar to |7|. 

1) Performances on Single-node Classifiers: This experi- 
ment compares single strong classifier learned using AdaBoost 
and GSLDA algorithms in their classification performance. 
The datasets consist of three training sets and two test sets. 
Each training set contains 2,000 face examples and 2,000 
non-face examples (Table [l]). The dataset consists of 10, 000 
mirrored faces. The faces were cropped and rescaled to images 
of size 24 x 24 pixels. For non-face examples, we randomly 
selected 10, 000 random non-face patches from non-face im- 
ages obtained from the internet. Fig. [3] shows a random sample 
of face training images. 

For each experiment, three different classifiers are gener- 
ated, each by selecting two out of the three training sets and 
the remaining training set for validation. The performance is 
measured by two different curves:- the test error rate and the 
classifier learning goal (the false alarm error rate on test set 
given that the detection rate on the validation set is fixed 



at 99%). A 95% confidence interval of the true mean error 
rate is given by the t-distribution. In this experiment, we test 
two different approaches of GSLDA: forward-pass GSLDA 
and dual-pass (forward+backward) GSLDA. The results are 
shown in Fig.|4] The following observations can be made from 
these curves. Having the same number of learned Haar-like 
rectangle features, GSLDA achieves a comparable error rate 



to AdaBoost on test sets (Fig. 4(a)). GSLDA seems to perform 
slightly better with less number of Haar-like features (< 100) 
while AdaBoost seems to perform slightly better with more 
Haar-like features (> 100). However, both classifiers perform 
almost similarly within 95% confidence interval of the true 
error rate. This indicates that features selected using GSLDA 
classifier are as meaningful as features selected using Ad- 
aBoost classifier. From the curve, GSLDA with bi-directional 
search yields better results than GSLDA with forward search 



only. Fig. |4(b)| shows the false positive error rate on test 
set. From the figure, both GSLDA and AdaBoost achieve a 
comparable false positive error rate on test set. 

2 ) Performances on Cascades of Strong Classifiers: In this 
experiment, we used 5, 000 mirrored faces from previous 
experiment. The non-face samples used in each cascade layer 
are collected from false positives of the previous stages of 
the cascade (bootstrapping). The cascade training algorithm 
terminates when there are not enough negative samples to 
bootstrap. For fair evaluation, we trained both techniques 
with the same number of weak classifiers in each cascade. 
Note that since dual pass GSLDA (forward+backward search) 
yields better solutions than the forward search in the previous 
experiment, we use dual pass GSLDA classifier to train a 
cascade of face detectors. We tested our face detectors on the 
low resolution faces dataset, MIT+CMU frontal face test set. 
The complete set contains 130 images with 507 frontal faces. 
In this experiment, we set the scaling factor to 1.2 and window 
shifting step to 1. The technique used for merging overlapping 
windows is similar to |1|. Detections are considered true or 
false positives based on the area of overlap with ground truth 
bounding boxes. To be considered a correct detection, the area 
of overlap between the predicted bounding box and ground 
truth bounding box must exceed 50%. Multiple detections of 
the same face in an image are considered false detections. 



Figs. 5(a) and 5(b) show a comparison between the Receiver 
Operating Characteristic (ROC) curves produced by GSLDA 



classifier and AdaBoost classifier. In Fig. |5(a)| the number 
of weak classifiers in each cascade stage is predetermined 
while in Fig. |5(b)[ weak classifiers are added to the cascade 
until the predefined objective is met. The ROC curves show 
that GSLDA classifier outperforms AdaBoost classifier at all 
false positive rates. We think that by adjusting the threshold 
to the AdaBoost classifier (in order to achieve high detection 
rates with moderate false positive rates), the performance of 
AdaBoost is no longer optimal. Our findings in this work are 
consistent with the experimental results reported in |5 | and 
ITtI . Q used LDA weights instead of weak classifiers' weights 
provided by AdaBoost algorithm. 

GSLDA not only performs better than AdaBoost but it is 
also much simpler. Weak classifiers learning (decision stumps) 
is performed only once for the given set of samples (unlike 
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AdaBoost where weak classifiers have to be re-trained in each 
boosting iteration). GSLDA algorithm sequentially selects 
decision stump whose output yields the maximum eigenvalue. 
The process continues until the stopping criteria are met. Note 
that given the decision stumps selected by GSLDA, any linear 
classifiers can be used to calculate the weight coefficients. 
Based on our experiments, using linear SVM (maximizing the 
minimum margin) instead of LDA also gives a very similar 
result to our GSLDA detector. We believe that using one 
objective criterion for feature selection and another criterion 
for classifier construction would provide a classifier with more 
flexibility than using the same criterion to select feature and 
train weight coefficients. These findings open up many more 
possibilities in combining various feature selection techniques 
with many existing classification techniques. We believe that 
a better and faster object detector can be built with careful 
design and experiment. 

Haar-like rectangle features selected in the first cascade 
layer of both classifiers are shown in Fig. [7] Note that 
both classifiers select Haar-like features which cover the area 
around the eyes and forehead. Table |Il| compares the two 
cascaded classifiers in terms of the number of weak classi- 
fiers and the average number of Haar-like rectangle features 
evaluated per detection window. Comparing GSLDA with 
AdaBoost, we found that GSLDA performance gain comes at 
the cost of a higher computation time. This is not surprising 
since the number of decision stumps available for training 
GSLDA classifier is much smaller than the number of decision 
stumps used in training AdaBoost classifier. Hence, AdaBoost 
classifier can choose a more powerful/meaningful decision 
stump. Nevertheless, GSLDA classifier outperforms AdaBoost 
classifier. This indicates that the classifier trained to maximize 
class separation might be more suitable in the domain where 
the distribution of positive and negative samples is highly 
skewed. In the next section, we conduct an experiment on 
BGSLDA. 



B. Face Detection with BGSLDA classifiers 

The following experiments compare BGSLDA and differ- 
ent boosting learning algorithms in their performances for 
face detection. BGSLDA (weight scheme 1) corresponds to 
GSLDA classifier with decision stumps being re-weighted 
using AdaBoost scheme while BGSLDA (weight scheme 2) 
corresponds to GSLDA classifier with decision stumps being 
re-weighted using AsymBoost scheme (for highly skewed 
sample distributions). AsymBoost used in this experiment is 
from |5|. However, any asymmetric boosting approach can be 
applied here e.g. 1231 . iBTI . 

1) Performances on Single Node Classifiers: The exper- 
imental setup is similar to the one described in previous 
section. The results are shown in fig ure [4| The following 
conclusions can be made from figure |4(c)P Given the same 
number of weak classifiers, BGSLDA always achieves lower 
generalization error rate than AdaBoost. However, in terms of 
training error, AdaBoost achieves lower training error rate than 
BGSLDA. This is not surprising since AdaBoost has a faster 
convergence rate than BGSLDA. From the figure, AdaBoost 



only achieves lower training error rate than BGSLDA when 



the number of Haar-like rectangle features > 50. Fig. |4(d) 
shows the false alarm error rate. The false positive error rate 
of both classifiers are quite similar. 

2) Performances on Cascades of Strong Classifiers: The 
experimental setup and evaluation techniques used here are 
similar to the one described in Section IIILAll The results 
are shown in Fig. [5] Fig. 5(a) shows a comparison between 
the ROC curves produced by BGSLDA (scheme 1) classifier 
and AdaBoost classifier trained with the same number of 
weak classifiers in each cascade. Both ROC curves show that 
the BGSLDA classifier outperforms both AdaBoost and Ad- 



aBoost+LDA |7|. Fig. |5(b)| shows a comparison between the 
ROC curves of different classifiers when the number of weak 
classifiers in each cascade stage is no longer predetermined. 
At each stage, weak classifiers are added until the predefined 
objective is met. Again, BGSLDA significantly outperforms 
other evaluated classifiers. Fig. [8] demonstrates some face 
detection results on our BGSLDA (scheme 1) detector. 

In the next experiment, we compare the performance of 
BGSLDA (scheme 2) with other classifiers using asymmetric 
weight updating rule |5|. In other words, the asymmetric 
multiplier exp(-^y^ log V^) is applied to every sample before 
each round of weak classifier training. The results are shown 
in Fig. [6] Fig. 6(a) shows a comparison between the ROC 
curves trained with the same number of weak classifiers in 



each cascade stage. Fig. 6(b) shows the ROC curves trained 
with 99.5% detection rate and 50% false positive rate criteria. 
From both figures, BGSLDA (scheme 2) classifier outperforms 
other classifiers evaluated. BGSLDA (scheme 2) classifier also 
outperforms BGSLDA (scheme 1) classifier. This indicates 
that asymmetric loss might be more suitable in domains where 
the distribution of positive examples and negative examples is 
highly imbalanced. Note that the performance gain between 
BGSLDA (scheme 1) and BGSLDA (scheme 2) is quite small 
compared with the performance gain between AdaBoost and 
AsymBoost. Since, LDA takes the number of samples of 
each class into consideration when solving the optimization 
problem, we believe this reduces the performance gap between 
BGSLDA (scheme 1) and BGSLDA (scheme 2). 

Table [n] indicates that our BGSLDA (scheme 1) classifier 
performs at a speed comparable to AdaBoost classifier. How- 
ever, compared with AdaBoost+LDA, the performance gain 
of BGSLDA comes at the slightly higher cost in computation 
time. In terms of cascade training time, on a desktop with an 
Intel Core™ 2 Duo CPU T7300 with 4GB RAM, the total 
training time is less than one day. 

As mentioned in |24], a more general technique for gener- 
ating discriminating hyperplanes is to define the total within- 
class covariance matrix as 



(9) 



where /ii is the mean of class 1 and /i2 is the mean of 
class 2. The weighting parameter 7 controls the weighted 
classification error. We have conducted an experiment on 
BGSLDA (scheme 1) with different value of 7, namely 
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Fig. 4. See text for details (best viewed in color), (a) Comparison of test error rates between GSLDA and AdaBoost. (b) Comparison of false alarm rates 
on test set between GSLDA and AdaBoost. The detection rate on the validated face set is fixed at 99%. (c) Comparison of train and test error rates between 
BGSLDA (scheme 1) and AdaBoost. (d) Comparison of false alarm rates on test set between BGSLDA (scheme 1) and AdaBoost. 




Fig. 5. Comparison of ROC curves on the MIT+CMU face test set (a) with the same number of weak classifiers in each cascade stage on AdaBoost and its 
variants, (b) with 99.5% detection rate and 50% false positive rate in each cascade stage on AdaBoost and its variants. BGSLDA (scheme 1) corresponds to 
GSLDA classifier with decision stumps being re-weighted using AdaBoost scheme. 



7 G {0.1, 0.5, 1.0, 2.0, 10.0}. All the other experiment settings results are shown in Fig. [9] Based on ROC curves, it can be 
remain the same as described in the previous section. The seen that all configurations of BGSLDA classifiers outperform 
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Fig. 6. Comparison of ROC curves on the MIT+CMU face test set (a) with the same number of weak classifiers in each cascade stage on AsymBoost and its 
variants, (b) with 99.5% detection rate and 50% false positive rate in each cascade stage on AsymBoost and its variants. BGSLDA (scheme 2) corresponds 
to GSLDA classifier with decision stumps being re-weighted using AsymBoost scheme. 



AdaBoost 



0.55 0.47 0.38 0.35 0.26 0.24 0.24 



GSLDA 




0.43 0.47 0.40 0.36 0.30 0.36 0.25 
AdaBoost+LDA 





0.52 0.33 0.52 0.32 0.23 0.26 0.32 
BGSLDA (scheme 1) 

^ ^ ^ H ^ 

0.44 0.43 0.39 0.31 0.42 0.27 0.32 



Fig. 7. The first seven Haar-like rectangle features selected from the 
first layer of the cascade. The value below each Haar-like rectangle features 
indicates the normalized feature weight. For AdaBoost, the value corresponds 
to the normalized a where a is computed from log((l — et)/et) and et 
is the weighted error. For LDA, the value corresponds to the normalized w 
such that for input vector x and a class label y, iv^ x leads to maximum 
separation between two classes. 



AdaBoost classifier at all false positive rates. Setting 7 = 1 
gives the highest detection rates when the number of false 
positives is larger than 200. Setting 7 = 0.5 performs best 
when the number of false positives is very small. 

C. Pedestrian Detection with GSLDA and BGSLDA Classifiers 

In this section, we apply the proposed algorithm to pedes- 
trian detection, which is considered a more difficult problem 
than face detection. 

1) Pedestrian Detection on the Daimler- Chrsyler dataset 
with Haar-like Features: In this experiment, we evaluate the 



performance of our techniques on Daimler-Chrsyler pedestrian 
dataset ||25]| . The dataset contains a set of extracted pedestrian 
and non-pedestrian samples which are scaled to size 18 x 36 
pixels. The dataset consists of three training sets and two test 
sets. Each training set contains 4, 800 pedestrian examples and 
5, 000 non-pedestrian examples. Performance on the test sets is 
analyzed similarly to the techniques described in | 25 1. For each 
experiment, three different classifiers are generated. Testing 
all three classifiers on two test sets yields six different ROC 
curves. A 95% confidence interval of the true mean detection 
rate is given by the t-distribution. We conducted three exper- 
iments using Haar-like features trained with three different 
classifiers: AdaBoost, GSLDA and BGSLDA (scheme 1). The 
experimental setup is similar to the previous experiments. 
Fig. 
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shows detection results of different classifiers. 
Again, the ROC curves show that LDA classifier outperforms 
AdaBoost classifier at all false positive rates. Clearly these 
curves are consistent with those on face datasets. 

2) Pedestrian Detection on the INRIA dataset with Co- 
variance Features: We also conduct experiments on INRIA 
pedestrian datasets. We compare the performance of our 
method with other state-of-the-art results. The INRIA dataset 
[|26il consists of one training set and one test set. The training 
set contains 2,416 mirrored pedestrian examples and 1,200 
non-pedestrian images. The pedestrian samples were obtained 
from manually labeling images taken at various time of the 
days and various locations. The pedestrian samples are mostly 
in standing position. A border of 8 pixels is added to the 
sample in order to preserve contour information. All samples 
are scaled to size 64 x 128 pixels. The test set contains 1, 176 
mirrored pedestrian examples extracted from 288 images and 
453 non-pedestrian test images. 

Since, Haar-like features perform poorly on this dataset, we 
apply covariance features instead of Haar-like features (271 . 
|18|. However, decision stump can not be directly applied 
since the algorithm is not applicable to multi-dimensional data. 
To overcome this problem, we apply LDA that projects a 
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TABLE II 

Comparison of number of weak classifiers. The number of cascade stages and total weak classifiers were obtained from the 
classifiers trained to achieve a detection rate of 99.5% and the maximum false positive rate of 50% in each cascade layer. the 

AVERAGE NUMBER OF HAAR-LIKE RECTANGLES EVALUATED WAS OBTAINED FROM EVALUATING THE TRAINED CLASSIFIERS ON MIT+CMU FACE TEST 

SET. 



method 


number of stages 


total number of weak classifiers 


average number of Haar features evaluated 


AdaBoost 1 1 1 


22 


1771 


23.9 


AdaBoost+LDA |7 | 


22 


1436 


22.3 


GSLDA 


24 


2985 


36.0 


BGSLDA (scheme 1) 


23 


1696 


24.2 


AsymBoost |5 | 


22 


1650 


22.6 


AsymBoost+LDA K?) 


22 


1542 


21.5 


BGSLDA (scheme 2) 


23 


1621 


24.9 



Algorithm 3 The algorithm for training multi-dimensional 
features. 

1 foreach multi-dimensional feature do 

1. Calculate the projection vector with LDA and project the 
multi-dimensional feature to ID space; 

2. Train decision stump classifiers to find an optimal 
threshold using positive and negative training set; 



performs very similar to existing co variance techniques 1271 , 
1 18 1 at low false positive rates (lower than 10~^). This method, 
however, seems to perform poorly at high false positive rates. 
Nonetheless, most real-world applications often focus on low 
false detections. Compared to boosted covariance features, the 
training time of cascade classifiers is reduced from weeks to 
days on a standard PC. 



multi-dimensional data onto a ID space first. In brief, we 
stack covariance features and project them onto ID space. 
Decision stumps are then applied as weak classifiers. Our 
training technique is different from ifTSl . (TSll applied Ad- 
aBoost with weighted linear discriminant analysis (WLDA) 
as weak classifiers. The major drawback of |18| is a slow 
training time. Since, each training sample is assigned a weight, 
weak classifiers (WLDA) need to be trained T times, where 
T is the number of boosting iterations. In this experiment, 
we only train weak classifiers (LDA) once and store their 
projected result into a table. Because most of the training time 
in ifTSl is used to train WLDA, the new technique requires only 
^ training time as that of |18|. After we project the multi- 
dimensional covariance features onto a ID space using LDA, 
we train decision stumps on these ID features. In other words, 
we replace line 4 and 5 in Algorithm [T] with Algorithm [3] 

In this experiment, we generate a set of over-complete 
rectangular covariance filters and subsample the over-complete 
set in order to keep a manageable set for the training phase. 
The set contains approximately 45, 675 covariance filters. In 
each stage, weak classifiers are added until the predefined 
objective is met. We set the minimum detection rate to be 
99.5% and the maximum false positive rate to be 35% in each 
stage. The cascade threshold value is then adjusted such that 
the cascade rejects 50% negative samples on the training sets. 
Each stage is trained with 2, 416 pedestrian samples and 2, 500 
non-pedestrian samples. The negative samples used in each 
stage of the cascades are collected from false positives of the 
previous stages of the cascades. 



Fig. 1 1 shows a comparison of our experimental results on 
learning ID covariance features using AdaBoost and GSLDA. 
The ROC curve is generated by adding one cascade level 
at a time. From the curve, GSLDA classifier outperforms 
AdaBoost classifiers at all false positive rates. The results 
seem to be consistent with our results reported earlier on face 
detection. On a closer observation, our simplified technique 



IV. Conclusion 

In this work, we have proposed an alternative approach 
in the context of visual object detection. The core of the 
new framework is greedy sparse linear discriminant analysis 
(GSLDA) 1 2 1, which aims to maximize the class-separation 
criterion. On various datasets for face detection and pedestrian 
detection, we have shown that this technique outperforms Ad- 
aBoost when the distribution of positive and negative samples 
is highly skewed. To further improve the detection result, we 
have proposed a boosted version GSLDA, which combines 
boosting re-weighting scheme with decision stumps used for 
training the GSLDA algorithm. Our extensive experimental 
results show that the performance of BGSLDA is better than 
that of AdaBoost at a similar computation cost. 

Future work will focus on the search for more efficient weak 
classifiers and on-line updating the learned model. 
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Fig. 10. Pedestrian detection performance comparison on the Daimler- 
Chrysler pedestrian dataset i25i . 
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features (projected covariance) trained using AdaBoost and GSLDA on the 
INRIA dataset |26J. 
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